A recent survey from the Bank of England and Financial Conduct Authority found that over two thirds of UK financial services organisations have live machine learning applications and their usage is expected to double within the next three years. Financial organisations are well placed to derive business value as they have access to the large and complex datasets that are necessary for building a variety of predictive models - example use cases include the probability of default on loans, fraudulent transaction detection, customer churn prediction, anti-money laundering measures (AML), purchasing intention and personalisation.
As businesses race to drive insight from their data, this highly regulated sector faces its own challenges; the need to protect sensitive personally identifiable information (PII) hinders collaboration, and the time needed to clean this data and check for compliance adds months onto project timescales. Even when prepared, the dataset itself may be under-representative making it ill suited to the accurate development and training of AI solutions. These data limitations are the most cited major barriers that prevent finance organisations utlising their data assets, with over 60% of company data remaining unused for analytics.
In the last three years, synthesized or the so-called synthetic data technologies have increasingly been adopted by leading insurers and consumer-facing financial services companies to solve a number of data provisioning and data preparation challenges. When implemented accurately, the same results can be obtained with synthesized datasets, and the benefits include full data privacy compliance and a major reduction in the time needed for product development and testing; synthesizing high-quality data can take as little as 10 minutes for large-scale datasets.
But this is just the tip of the iceberg, and in this post we demonstrate how to enable the stability and optimal performance of crucial machine learning models in banking and insurance fast using data rebalancing with the Synthesized core platform.
Data Rebalancing with Applications in Personalised Marketing, Customer Segmentation and Fraud Detection
Customers always prefer to get personalized financial services which match their needs and lifestyle. Businesses offering customer-facing financial services face the challenge of ensuring that digital communication with their customers meets these demands. Highly personalized experiences are assured with the help of machine learning and advanced data science, extracting the insights from data which encapsulates consumers' preferences, interaction, behavior, lifestyle details and interests. The successful personalization of offers, policies and pricing makes a large contribution to the revenues of the business.
Marketing departments apply various techniques to increase the number of customers and to assure targeted marketing strategies. Customer segmentation plays a pivotal role in this process. Algorithms segment customers according to their financial sophistication, age, location, etc, classifying them into groups by spotting similarities in their attitude, preferences, behavior, or personal information. As a result, target cross-selling policies may be developed and personal services may be tailored for each particular segment.
A major obstacle to building and validating marketing strategies is getting access to representative data about customer segments. It is very common that the most valuable information for the business is hidden in an under-representative customer category. For example, the online shoppers purchasing intention contains 12,330 sessions, of which only 1908 (15.47 %) ended in shopping, and in the credit loan default dataset contains 663 (6.6%) defaulters out of 10000. If a predictive model is trained on a biased dataset, the results will be biased - in this scenario, any ensuing wrong decisions lead to a higher customer cost of acquisition and a poor experience for those targeted with inappropriate offers, making them less likely to purchase at all.
How to Overcome Imbalances in Data?
A way to overcome this issue is to generate new samples for an under-representative category thereby rebalancing the dataset.
Data rebalancing is crucial for addressing these imbalances. We further provide a numerical illustration. To evaluate the performance of rebalanced datasets we use the so-called AUC score, as it is a widely used metric for imbalanced datasets.
To check how the minority class proportion affects the final results, the following procedure is carried out.
- We split the dataset into the training and test sets with ratio 4:1.
- The training set is resampled from the original proportion to 1:1, so that both classes have the same number of samples.
- We compute the evaluation metrics on the test set that remains unseen.
We compute the AUC metric as we resample data from the original data until we have the same number of samples for both classes. The results of 10 Monte Carlo simulations are shown in Figure 1. We can clearly observe an uptrend on the AUC score as the datasets are resampled.
Figure 2 shows the resampled dataset outperforms the original dataset. The privacy of the data is protected, so the data scientist can still look at the data and manipulate it without being in contact with any sensitive information about the users, as the technology used in Synthesized to generate data ensures full compliance on data privacy regulations.
AUC and PR curves for the credit scoring dataset, before and after resampling the dataset using different techniques.
How Data Rebalancing Improves the Performance of Models
Furthermore, it is often critical to detect where precisely the algorithm is making wrong decisions, as the cost of a false negative can be huge compared to a false positive. Both credit scoring and online shoppers purchasing datasets exemplify this matter, as giving credit to a defaulter is much more costly than not giving credit to a non-defaulter, and similarly targeting a non-buyer is usually less expensive than losing a buyer.
Figure 3 throws light on this matter, showing the confusion matrix for both datasets. A "Random Forest" model is trained on the original (left) and re-sampled with Synthesized (right) sets. In the first case, the majority of errors are concentrated on false negatives rather than false positives, while the resampled case, the number of false negatives is drastically decreased. Data rebalancing with Synthesized’s platform has proven to significantly reduce these errors.
In summary, we have presented how resampling an imbalanced dataset can heavily affect the performance of the machine learning model. Synthesized’s data rebalancing feature is simple to use and is now part of the core product offering, giving users the ability to easily manipulate the distribution of the variables to rebalance the dataset and increase the model’s performance, making the project easier and more successful for the team in charge.
Learn More
Explore the performance of data rebalancing in other common industry-specific scenarios and review the data produced by the Synthesized data provisioning platform in detail by contacting our data experts at team@synthesized.io.
FAQs
What is data rebalancing and why is it important for financial services organisations?
Data rebalancing is the process of adjusting the distribution of classes in a dataset to address imbalances that can skew machine learning models. For financial services organisations, data rebalancing is crucial because it ensures that predictive models are trained on representative data, leading to more accurate and reliable outcomes. This is particularly important in scenarios such as credit scoring, fraud detection, and personalized marketing, where biased data can result in significant financial and reputational risks. By implementing data rebalancing, financial institutions can improve the performance and fairness of their AI-driven decisions, ultimately enhancing customer satisfaction and operational efficiency.
How does data rebalancing enhance the performance of machine learning models in detecting fraudulent transactions?
Data rebalancing enhances the performance of machine learning models in detecting fraudulent transactions by ensuring that the model is trained on a dataset that accurately represents both fraudulent and legitimate transactions. Without data rebalancing, models might be biased towards the majority class (usually legitimate transactions), leading to a higher rate of false negatives (fraudulent transactions being missed). By rebalancing the data, financial institutions can reduce these false negatives and improve the model's ability to correctly identify fraudulent activities. This leads to more robust fraud detection systems, which can save organisations significant amounts of money and protect their customers from fraud.
Can data rebalancing help in improving customer segmentation for targeted marketing campaigns?
Yes, data rebalancing can significantly improve customer segmentation for targeted marketing campaigns. By rebalancing datasets, financial services organisations can ensure that their customer segmentation models are trained on representative data, capturing the nuances of different customer groups. This leads to more accurate segmentation, allowing for highly personalized marketing strategies. For instance, if certain customer segments are underrepresented in the original dataset, data rebalancing can help generate synthetic samples to balance the dataset, ensuring that the segmentation model identifies and targets these groups effectively. This results in more precise and effective marketing efforts, driving higher engagement and conversion rates.
What are the challenges of implementing data rebalancing in financial services, and how can they be overcome?
Implementing data rebalancing in financial services can present several challenges, including data privacy concerns, the complexity of financial datasets, and the need for specialized expertise. Data privacy is a major concern, as financial institutions must comply with strict regulations when handling sensitive customer information. To overcome this, organisations can use synthetic data generation techniques that create realistic but anonymized datasets for rebalancing purposes. Additionally, the complexity of financial datasets requires sophisticated tools and algorithms to accurately rebalance the data. Using advanced data science platforms like Synthesized can simplify this process, providing user-friendly interfaces and automated rebalancing capabilities. Finally, addressing the need for specialized expertise involves investing in training and hiring skilled data scientists who can effectively implement and manage data rebalancing strategies.